Pandas: Use for data manipulation and data analysis.
Numpy: fundamental package for scientific computing with Python.
Matplotlib and Seaborn : For plotting and visualization.
Scikit-learn : For the data preprocessing techniques and algorithms.
# Importing required Packages
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
# Read Loan Dataset
loan_data=pd.read_csv("train.csv")
test=pd.read_csv("test.csv")
# Features in the dataset
loan_data.columns
We have 12 independent variables and 1 target variable, i.e. Loan_Status in the loan_data dataset
# Shape of the dataset
loan_data.shape
loan_data.head()
loan_data.info()
This is a very important stage in any data science/machine learning pipeline. It involves understanding the problem in detail by brainstorming as many factors as possible which can impact the outcome. It is done by understanding the problem statement thoroughly and before looking at the data.
Below are some of the factors which I think can affect the Loan Approval (dependent variable for this loan prediction problem):
Salary: Applicants with high income should have more chances of loan approval.
Previous history: Applicants who have repayed their previous debts should have higher chances of loan approval.
Loan amount: Loan approval should also depend on the loan amount. If the loan amount is less, chances of loan approval should be high.
Loan term: Loan for less time period and less amount should have higher chances of approval.
EMI: Lesser the amount to be paid monthly to repay the loan, higher the chances of loan approval.
These are some of the factors which i think can affect the target variable, you can come up with many more factors.
You can download the dataset from: https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/
loan_data['Loan_Status'].value_counts().plot.bar()
loan_data['Loan_Status'].value_counts()
# Normalize can be set to True to print proportions instead of number
loan_data['Loan_Status'].value_counts(normalize=True)
The loan of 422(around 69%) people out of 614 was approved.
Now lets visualize each variable separately. Different types of variables are Categorical, ordinal and numerical.
Categorical features: These features have categories (Gender, Married, Self_Employed, Credit_History, Loan_Status)
Ordinal features: Variables in categorical features having some order involved (Dependents, Education, Property_Area)
Numerical features: These features have numerical values (ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term)
# Visualizing categorical features
plt.figure(1)
plt.subplot(221)
loan_data['Gender'].value_counts(normalize=True).plot.bar(figsize=(20,10), title= 'Gender')
plt.subplot(222)
loan_data['Married'].value_counts(normalize=True).plot.bar(title= 'Married')
plt.subplot(223)
loan_data['Dependents'].value_counts(normalize=True).plot.bar(title= 'Dependents')
plt.subplot(224)
loan_data['Education'].value_counts(normalize=True).plot.bar(title= 'Education')
plt.show()
It can be inferred from the above bar plots that:
80% applicants in the dataset are male. Around 65% of the applicants in the dataset are married. Around 15% applicants in the dataset are self employed. Around 85% applicants have repaid their debts.
Independent Variable (Ordinal)
# Independent Variable (Ordinal)
plt.figure(1)
plt.subplot(131)
loan_data['Self_Employed'].value_counts(normalize=True).plot.bar(figsize=(24,6), title= 'Self_Employed')
plt.subplot(132)
loan_data['Credit_History'].value_counts(normalize=True).plot.bar(title= 'Credit_History')
plt.subplot(133)
loan_data['Property_Area'].value_counts(normalize=True).plot.bar(title= 'Property_Area')
plt.show()
Following inferences can be made from the above bar plots:
Most of the applicants don’t have any dependents. Around 80% of the applicants are Graduate. Most of the applicants are from Semiurban area.
# Visualizing numerical features
plt.figure(1)
plt.subplot(121)
sns.distplot(loan_data['ApplicantIncome']);
plt.subplot(122)
loan_data['ApplicantIncome'].plot.box(figsize=(16,5))
plt.show()
It can be inferred that most of the data in the distribution of applicant income is towards left which means it is not normally distributed. We will try to make it normal in later sections as algorithms works better if the data is normally distributed.
The boxplot confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society. Part of this can be driven by the fact that we are looking at people with different education levels. Let us segregate them by Education:
loan_data.boxplot(column='ApplicantIncome', by = 'Education')
We can see that there are a higher number of graduates with very high incomes, which are appearing to be the outliers.
plt.figure(1)
plt.subplot(121)
sns.distplot(loan_data['CoapplicantIncome']);
plt.subplot(122)
loan_data['CoapplicantIncome'].plot.box(figsize=(16,5))
plt.show()
We see a similar distribution as that of the applicant income. Majority of coapplicant’s income ranges from 0 to 5000. We also see a lot of outliers in the coapplicant income and it is not normally distributed.
Let’s look at the distribution of LoanAmount variable.
plt.figure(1)
plt.subplot(121)
df=loan_data.dropna()
sns.distplot(df['LoanAmount']);
plt.subplot(122)
loan_data['LoanAmount'].plot.box(figsize=(16,5))
plt.show()
We see a lot of outliers in this variable and the distribution is fairly normal. We will treat the outliers in later sections.
Now we would like to know how well each feature correlate with Loan Status. So, in the next section we will look at bivariate analysis.
print(pd.crosstab(loan_data['Gender'],loan_data['Loan_Status']))
Gender=pd.crosstab(loan_data['Gender'],loan_data['Loan_Status'])
Gender.div(Gender.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True, figsize=(4,4))
plt.xlabel('Gender')
p = plt.ylabel('Percentage')
print(pd.crosstab(loan_data['Married'],loan_data['Loan_Status']))
Married=pd.crosstab(loan_data['Married'],loan_data['Loan_Status'])
Married.div(Married.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True, figsize=(4,4))
plt.xlabel('Married')
p = plt.ylabel('Percentage')
print(pd.crosstab(loan_data['Dependents'],loan_data['Loan_Status']))
Dependents=pd.crosstab(loan_data['Dependents'],loan_data['Loan_Status'])
Dependents.div(Dependents.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True)
plt.xlabel('Dependents')
p = plt.ylabel('Percentage')
print(pd.crosstab(loan_data['Education'],loan_data['Loan_Status']))
Education=pd.crosstab(loan_data['Education'],loan_data['Loan_Status'])
Education.div(Education.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True, figsize=(4,4))
plt.xlabel('Education')
p = plt.ylabel('Percentage')
print(pd.crosstab(loan_data['Self_Employed'],loan_data['Loan_Status']))
Self_Employed=pd.crosstab(loan_data['Self_Employed'],loan_data['Loan_Status'])
Self_Employed.div(Self_Employed.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True, figsize=(4,4))
plt.xlabel('Self_Employed')
p = plt.ylabel('Percentage')
print(pd.crosstab(loan_data['Credit_History'],loan_data['Loan_Status']))
Credit_History=pd.crosstab(loan_data['Credit_History'],loan_data['Loan_Status'])
Credit_History.div(Credit_History.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True, figsize=(4,4))
plt.xlabel('Credit_History')
p = plt.ylabel('Percentage')
print(pd.crosstab(loan_data['Property_Area'],loan_data['Loan_Status']))
Property_Area=pd.crosstab(loan_data['Property_Area'],loan_data['Loan_Status'])
Property_Area.div(Property_Area.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True)
plt.xlabel('Property_Area')
P = plt.ylabel('Percentage')
# Making bins for Applicant income variable
bins=[0,2500,4000,6000,81000]
group=['Low','Average','High', 'Very high']
loan_data['Income_bin']=pd.cut(df['ApplicantIncome'],bins,labels=group)
Income_bin=pd.crosstab(loan_data['Income_bin'],loan_data['Loan_Status'])
Income_bin.div(Income_bin.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True)
plt.xlabel('Property_Area')
P = plt.ylabel('Percentage')
It can be inferred that Applicant income does not affect the chances of loan approval which contradicts our hypothesis in which we assumed that if the applicant income is high the chances of loan approval will also be high.
# Making bins for Coapplicant income variable
bins=[0,1000,3000,42000]
group=['Low','Average','High']
loan_data['Coapplicant_Income_bin']=pd.cut(df['CoapplicantIncome'],bins,labels=group)
Coapplicant_Income_bin=pd.crosstab(loan_data['Coapplicant_Income_bin'],loan_data['Loan_Status'])
Coapplicant_Income_bin.div(Coapplicant_Income_bin.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True)
plt.xlabel('Property_Area')
P = plt.ylabel('Percentage')
# Making bins for LoanAmount variable
bins=[0,100,200,700]
group=['Low','Average','High']
loan_data['LoanAmount_bin']=pd.cut(df['LoanAmount'],bins,labels=group)
LoanAmount_bin=pd.crosstab(loan_data['LoanAmount_bin'],loan_data['Loan_Status'])
LoanAmount_bin.div(LoanAmount_bin.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True)
plt.xlabel('Property_Area')
P = plt.ylabel('Percentage')
# Drop the new variable of bins
loan_data=loan_data.drop(['Income_bin', 'Coapplicant_Income_bin', 'LoanAmount_bin'], axis=1)
# # replacing 3+ in Dependents variable with 3
# # replacing Y and N in Loan_Status variable with 1 and 0 respectively
# loan_data['Dependents'].replace(('0', '1', '2', '3+'), (0, 1, 2, 3),inplace=True)
# loan_data['Dependents'].replace(('0', '1', '2', '3+'), (0, 1, 2, 3),inplace=True)
loan_data['Dependents'].value_counts()
# loan_data['Loan_.value_counts()Status'].replace('N', 0,inplace=True)
# loan_data['Loan_Status'].replace('Y', 1,inplace=True)
# Print correlation matrix
matrix = loan_data.corr()
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(matrix, vmax=.8, square=True, cmap="BuPu");
import pandas_profiling
pandas_profiling.ProfileReport(loan_data)
# Checking the missing values
loan_data.isnull().sum()
There are missing values in Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term and Credit_History features.
We will treat the missing values in all the features one by one.
We can consider these methods to fill the missing values:
For numerical variables: imputation using mean or median.
For categorical variables: imputation using mode.
There are very less missing values in Gender, Married, Dependents, Credit_History and Self_Employed features so we can fill them using the mode of the features.
# replacing the missing values with the mode
loan_data['Gender'].fillna(loan_data['Gender'].mode()[0], inplace=True)
loan_data['Married'].fillna(loan_data['Married'].mode()[0], inplace=True)
loan_data['Dependents'].fillna(loan_data['Dependents'].mode()[0], inplace=True)
loan_data['Self_Employed'].fillna(loan_data['Self_Employed'].mode()[0], inplace=True)
loan_data['Credit_History'].fillna(loan_data['Credit_History'].mode()[0], inplace=True)
loan_data['Loan_Amount_Term'].value_counts()
It can be seen that in loan amount term variable, the value of 360 is repeating the most. So we will replace the missing values in this variable using the mode of this variable
loan_data['Loan_Amount_Term'].fillna(loan_data['Loan_Amount_Term'].mode()[0], inplace=True)
Now we will see the LoanAmount variable. As it is a numerical variable, we can use mean or median to impute the missing values. We will use median to fill the null values as earlier we saw that loan amount have outliers so the mean will not be the proper approach as it is highly affected by the presence of outliers.
# Replace missing values
loan_data['LoanAmount'].fillna(loan_data['LoanAmount'].median(), inplace=True)
# ## Using sklearn Imputer
# from sklearn.preprocessing import Imputer
# imp = Imputer(strategy='mean')
# X2 = imp.fit_transform(X)
Now lets check whether all the missing values are filled in the dataset.
loan_data.isnull().sum()
Outlier Treatment
Due to these outliers bulk of the data in the loan amount is at the left and the right tail is longer. This is called right skewness. One way to remove the skewness is by doing the log transformation. As we take the log transformation, it does not affect the smaller values much, but reduces the larger values. So, we get a distribution similar to normal distribution.
Let’s visualize the effect of log transformation. We will do the similar changes to the test file simultaneously.
# Removing skewness in LoanAmount variable by log transformation
loan_data['LoanAmount_log'] = np.log(loan_data['LoanAmount'])
# loan_data['LoanAmount_log'].hist(bins=20)
test['LoanAmount_log'] = np.log(test['LoanAmount'])
Now the distribution looks much closer to normal and effect of extreme values has been significantly subsided. Let’s build a logistic regression model and make predictions for the test dataset.
Curse of Dimensionality - http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/
PCA (Principal Component Analysis) mainly using to reduce the size of the feature space while retaining as much of the information as possible. In here all the features transformed into 2 features using PCA.
PCA Code:
###PCA Code:
# X = df.drop(['Class'], axis = 1)
# y = df['Class']
# pca = PCA(n_components=2)
# principalChttps://ashutoshtripathi.com/2019/06/13/what-is-multicollinearity/omponents = pca.fit_transform(X.values)
# principalDf = pd.DataFrame(data = principalComponents
# , columns = ['principal component 1', 'principal component 2'])
# finalDf = pd.concat([principalDf, y], axis = 1)
# finalDf.head()
Feature selection is also called variable selection or attribute selection.
It is the automatic selection of attributes in your data (such as columns in tabular data) that are most relevant to the predictive modeling problem you are working on.
Three benefits of performing feature selection before modeling your data are:
Reduces Complexity: It reduces the complexity of a model and makes it easier to interpret.
Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
Improves Accuracy: Less misleading data means modeling accuracy improves.
Reduces Training Time: It enables the machine learning algorithm to train faster.
Methods of Feature Selection:
https://ashutoshtripathi.com/2019/06/07/feature-selection-techniques-in-regression-model/
1. Filter Methods
Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.
Some examples of some filter methods include the Chi squared test, information gain and correlation coefficient scores
1. Univariate Selection
Statistical tests can be used to select those features that have the strongest relationship with the output variable.
2. Feature Importance
Feature importance gives you a score for each feature of your data, the higher the score more important or relevant is the feature towards your output variable.
3. Correlation Matrix with Heatmap
Correlation states how the features are related to each other or the target variable.
Correlation can be positive (increase in one value of feature increases the value of the target variable) or negative (increase in one value of feature decreases the value of the target variable).
2. Wrapper Methods
Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model us used to evaluate a combination of features and assign a score based on model accuracy.
The search process may be methodical such as a best-first search, it may stochastic such as a random hill-climbing algorithm, or it may use heuristics, like forward and backward passes to add and remove features.
An example if a wrapper method is the recursive feature elimination algorithm
3. Embedded Methods
Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented by algorithms that have their own built-in feature selection methods.
Some of the most popular examples of these methods are LASSO and RIDGE regression which have inbuilt penalization functions to reduce overfitting.
https://ashutoshtripathi.com/2019/06/13/what-is-multicollinearity/
https://github.com/bhattbhavesh91/chi-squared-feature-selection-selectkbest
Based on the domain knowledge, we can come up with new features that might affect the target variable. We will create the following three new features:
Total Income - As discussed during bivariate analysis we will combine the Applicant Income and Coapplicant Income. If the total income is high, chances of loan approval might also be high.
EMI - EMI is the monthly amount to be paid by the applicant to repay the loan. Idea behind making this variable is that people who have high EMI’s might find it difficult to pay back the loan. We can calculate the EMI by taking the ratio of loan amount with respect to loan amount term.
Balance Income - This is the income left after the EMI has been paid. Idea behind creating this variable is that if this value is high, the chances are high that a person will repay the loan and hence increasing the chances of loan approval.
# Total_Income feature
loan_data['Total_Income']=loan_data['ApplicantIncome']+loan_data['CoapplicantIncome']
sns.distplot(loan_data['Total_Income']);
loan_data.head()
# Total_Income feature
loan_data['LTV']=loan_data['LoanAmount']/loan_data['Total_Income']
sns.distplot(loan_data['LTV']);
loan_data['Total_Income_log'] = np.log(loan_data['Total_Income'])
sns.distplot(loan_data['Total_Income_log']);
# EMI feature
loan_data['EMI']=(loan_data['LoanAmount']/loan_data['Loan_Amount_Term'])*1000
sns.distplot(loan_data['EMI']);
# Balance Income feature
loan_data['Balance Income']=loan_data['Total_Income']-(loan_data['EMI'])
sns.distplot(loan_data['Balance Income']);
Lets drop the Loan_ID variable as it do not have any effect on the loan status. We will do the same changes to the test dataset which we did for the loan_dataing dataset.
loan_data=loan_data.drop('Loan_ID',axis=1)
Sklearn requires the target variable in a separate dataset. So, we will drop our target variable from the loan_data dataset and save it in another dataset.
x = loan_data.drop('Loan_Status',1)
y = loan_data.Loan_Status
Now we will make dummy variables for the categorical variables. Dummy variable turns categorical variables into a series of 0 and 1, making them lot easier to quantify and compare. Let us understand the process of dummies first:
Consider the “Gender” variable. It has two classes, Male and Female. As logistic regression takes only the numerical values as input, we have to change male and female into numerical value. Once we apply dummies to this variable, it will convert the “Gender” variable into two variables(Gender_Male and Gender_Female), one for each class, i.e. Male and Female. Gender_Male will have a value of 0 if the gender is Female and a value of 1 if the gender is Male.
PCA,LDA
Log transformation - penalizes higher values more than smaller ones.
Positively skewed data:
If tail is on the right as that of the second image in the figure, it is right skewed data. It is also called positive skewed data.
Common transformations of this data include square root, cube root, and log.
Cube root transformation:
The cube root transformation involves converting x to x^(1/3). This is a fairly strong transformation with a substantial effect on distribution shape: but is weaker than the logarithm. It can be applied to negative and zero values too. Negatively skewed data.
Square root transformation:
Applied to positive values only. Hence, observe the values of column before applying.
Logarithm transformation:
The logarithm, x to log base 10 of x, or x to log base e of x (ln x), or x to log base 2 of x, is a strong transformation and can be used to reduce right skewness.
Negatively skewed data:
If the tail is to the left of data, then it is called left skewed data. It is also called negatively skewed data.
Common transformations include square , cube root and logarithmic.
We will discuss what square transformation is as others are already discussed.
Square transformation:
The square, x to x², has a moderate effect on distribution shape and it could be used to reduce left skewness.
Link:https://medium.com/@TheDataGyan/day-8-data-transformation-skewness-normalization-and-much-more-4c144d370e55 http://fmwww.bc.edu/repec/bocode/t/transint.html
f, (ax1, ax2) = plt.subplots(2,1,figsize =( 15, 8))
sns.kdeplot(loan_data['LoanAmount'],shade=True, ax = ax1, color='red')
ax1.set_title('Before Log Transformation')
sns.kdeplot(loan_data['LoanAmount_log'],shade=True, ax = ax2, color='blue')
ax2.set_title('After Log Transformation')
plt.show()
Feature scaling is a method used to standardize the range of independent variables or features of data
The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges.
Examples of Algorithms where Feature Scaling matters 1. K-Means uses the Euclidean distance measure here feature scaling matters. 2. K-Nearest-Neighbours also require feature scaling. 3. Principal Component Analysis (PCA): Tries to get the feature with maximum variance, here too feature scaling is required. 4. Gradient Descent: Calculation speed increase as Theta calculation becomes faster after feature scaling.
Note: Naive Bayes, Linear Discriminant Analysis, and Tree-Based models are not affected by feature scaling. In Short, any Algorithm which is Not Distance based is Not affected by Feature Scaling.
Types of Feature Scaling :
Min-Max (Normalization): Entire dataset will be in the range of 0 to 1.
# Feature Scaling - Min-max Scaling - Example
# Creating DataFrame first
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=(range(6)))
s2 = pd.Series([10, 9, 8, 7, 6, 5], index=(range(6)))
df = pd.DataFrame(s1, columns=['s1'])
df['s2'] = s2
df
# Use Scikit-Learn minmax_scaling
from mlxtend.preprocessing import minmax_scaling
minmax_scaling(df, columns=['s1', 's2'])
Standarization :
Split the data and then standarize the test data using same parameters .
One common standardization mistake is: we normalize the entire data and then split into train-test. This is incorrect. Test data should be completely unseen to anything during the modeling. We should, therefore, normalize the training data, and use its summary statistics to normalize the test data (for normalization, these statistics are the mean and variances of each feature).
Onehot encoding : Encoding for nominal(unordered) data.
#### Using pandas get Dummies
x=pd.get_dummies(x)
loan_data=pd.get_dummies(loan_data)
# ####Using sklearn OneHotEncoder
# from sklearn.preprocessing import OneHotEncoder
# encoder = OneHotEncoder()
# loan_datat_1hot = encoder.fit_transform(loan_data)
# loan_data_1hot
Label encoding : Encoding for ordinal(ordered) data.
# ####Using sklearn for Label Encoder
# # Import label encoder
# from sklearn import preprocessing
# # label_encoder object knows how to understand word labels.
# label_encoder = preprocessing.LabelEncoder()
# # Encode labels in column 'species'.
# df[loan_data]= label_encoder.fit_transform(loan_data)
Train file will be used for training the model, i.e. our model will learn from this file.
It contains all the independent variables and the target variable.
Test file contains all the independent variables, but not the target variable.
We will apply the model to predict the target variable for the test data.
#Split the data using Random Sampling
from sklearn.model_selection import train_test_split
X_loan_data,X_test,y_loan_data,y_test= train_test_split(x,y,test_size= 0.15,random_state= 5)
1.Under-sampling
2.Up-Sampling
3.SMOTE
4.Class Balance
a. Accuracy vs Interpretability
b. Bias & variance
c. Overfitting & Underfitting
L2(Ridge Regularization)
L1(Lasso Regularization)
Elastic Net
Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients.
Ridge regression performs L2 regularization which adds penalty equivalent to square of the magnitude of coefficients.
Let us make our first model to predict the target variable. We will start with Logistic Regression which is used for predicting binary outcome.
Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables.
Logistic regression is an estimation of Logit function. Logit function is simply a log of odds in favor of the event.
This function creates a s-shaped curve with the probability estimate, which is very similar to the required step wise function
# fit model
from sklearn.linear_model import LogisticRegression
classifier= LogisticRegression()
classifier.fit(X_loan_data,y_loan_data)
pred= classifier.predict(X_test)
from sklearn import metrics as m
# Accuracy
acc= m.accuracy_score(y_test,pred)
acc
So our predictions are almost 84% accurate, i.e. we have identified 84% of the loan status correctly.
Decision tree is a type of supervised learning algorithm(having a pre-defined target variable) that is mostly used in classification problems. In this technique, we split the population or sample into two or more homogeneous sets(or sub-populations) based on most significant splitter / differentiator in input variables.
Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable.
For detailed explanation visit https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/#six
# fit model
from sklearn import tree
classifier= tree.DecisionTreeClassifier(random_state=1)
classifier.fit(X_loan_data,y_loan_data)
pred= classifier.predict(X_test)
from sklearn import metrics as m
# Accuracy
acc= m.accuracy_score(y_test,pred)
acc
We got an accuracy of 0.70 which is much lesser than the accuracy from logistic regression model. So let’s build another model, i.e. Random Forest, a tree based ensemble algorithm and try to improve our model by improving the accuracy.
RandomForest is a tree based bootstrapping algorithm wherein a certain no. of weak learners (decision trees) are combined to make a powerful prediction model.
For every individual learner, a random sample of rows and a few randomly chosen variables are used to build a decision tree model.
Final prediction can be a function of all the predictions made by the individual learners.
In case of regression problem, the final prediction can be mean of all the predictions.
For detailed explanation visit this article https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/
# fit model
from sklearn.ensemble import RandomForestClassifier
classifier= RandomForestClassifier(n_estimators= 700)
classifier.fit(X_loan_data,y_loan_data)
pred= classifier.predict(X_test)
from sklearn import metrics as m
# Accuracy
acc= m.accuracy_score(y_test,pred)
acc
We got an accuracy of 0.81 from the random forest model .
Let us find the feature importance now, i.e. which features are most important for this problem. We will use featureimportances attribute of sklearn to do so.
(pd.Series(classifier.feature_importances_, index=x.columns)
.nlargest(8)
.plot(kind='barh'))
If we have a clear understanding about the problem we will select appropriately threshold by maximizing recall or precision.
Recall: Recall is highly important parameter in Medical & Legal related cases, Where consequence of not correctly identifying disease is worst.
E.g. Not identifying of Cancer patients is worst.
Precision: Precision is highly important parameter in customer facing machine oriented task, Where consequence of recommending wrong product to customer is worst.
E.g. Videos which are unsafe are recommended to children's.
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
results = confusion_matrix(y_test,pred)
print ('Confusion Matrix :')
print(results)
print ('Accuracy Score :',accuracy_score(y_test,pred))
print (classification_report(y_test,pred))
# GridSearchCV
from sklearn.model_selection import GridSearchCV
param_grid = [
# try 12 (3×4) combinations of hyperparameters
{'n_estimators': [80,100,250], 'max_features': [6, 8 ,10,12]},
# then try 6 (2×3) combinations with bootstrap set as False
{'bootstrap': [False,True], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_clf = RandomForestClassifier(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_clf, param_grid, cv=5, n_jobs = 1)
grid_search.fit(X_loan_data,y_loan_data)
# The best hyperparameter combinations
grid_search.best_params_
pred= grid_search.predict(X_test)
from sklearn import metrics as m
# Accuracy
acc= m.accuracy_score(y_test,pred)
acc
# # RandomizedSearchCV
# from sklearn.model_selection import RandomizedSearchCV
# from scipy.stats import randint
# param_distribs = {
# 'n_estimators': randint(low=1, high=200),
# 'max_features': randint(low=1, high=8),
# }
# forest_reg = RandomForestRegressor(random_state=42)
# rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
# n_iter=1, cv=8, scoring='neg_mean_squared_error', random_state=42)
# rnd_search.fit(X_loan_data,y_loan_data)
Congratulations! You already know quite a lot about Machine Learning. :)
Analytics Vidhya : https://www.analyticsvidhya.com/blog/
Kaggle Winners Interview : http://blog.kaggle.com/category/winners-interviews/